A novel gear RUL prediction method by diffusion model generation health index and attention guided multi-hierarchy LSTM

Gears, as indispensable components of machinery, demand accurate prediction of their Remaining Useful Life (RUL). To enhance the utilization of ordered information within time series data and elevate RUL prediction precision, this study introduces the attention-guided multi-hierarchy LSTM (AGMLSTM). This innovative approach leverages attention mechanisms to capture the intricate interplay between high and low hierarchical features of the input data, marking the first application of such a technique in gear RUL prediction. Additionally, a refined health indicator (HI) is introduced, constructed through a diffusion model, to precisely reflect the gears' health condition. The proposed RUL prediction method unfolds as follows: firstly, HIs are computed from gear vibration data. Subsequently, leveraging the known HIs, AGMLSTM predicts future HIs, and the RUL of the gear is determined upon surpassing the failure threshold. Quantitative analysis of experimental results conclusively demonstrates the superiority of the proposed RUL prediction method over existing approaches for gear RUL estimation.

1.One is that the prediction method can not mine ordered information of HIs fully and reasonably, which can decrease the feature extraction ability of models and impact the RUL prediction accuracy.2. Another is that there is rare work on the construction of HI with clear degradation trends and stable failure theories.
Facing the challenge, the article proposed a novel attention-guided multi-hierarchy LSTM (AGMLSTM) model.AGMLSTM not only can mine the feature of mixed hierarchy but also has the ability which is guided by the attention mechanism reasonably.Thus AGMLSTM is more suitable for gear RUL prediction.Besides, a suitable health index (HI) is beneficial for RUL prediction accuracy.In the paper, a novel HI which is smooth and has a clear trend constructed by the diffusion model is presented.Finally, based on the known HIs, the AGMLSTM is used to predict the future HIs step by step until it exceeds the preset failure value, and the RUL of gear is finally obtained.The outperformance of the presented RUL approach is illustrated by the quantitative evaluation of various indexes during the experiments.Particularly noteworthy is the remarkable achievement of 92% RUL prediction accuracy in the challenging task of predicting gear RUL within one hour, signifying the practical significance of our approach in online RUL prediction.
The main contributions in the article are as follows: 1.The adoption of the diffusion model represents a pioneering approach to constructing the HIs for gears, effectively mitigating fluctuations.Gear HI curves exhibit declining trends, and their failure thresholds are similar.2. AGMLSTM is proposed for gear RUL prediction.This method demonstrates enhanced capability in extracting ordered information, improving feature extraction, and boosting RUL estimation.3. Building on the diffusion model and AGMLSTM, the study proposes a novel prediction method, validated through comprehensive assessments of full-life vibration data for gears." The remainder of the article is arranged as follows."Theoretical basis" not only introduces the concept of diffusion model but also introduces LSTM.The details of the proposed methods are described in "The proposed methodology".The experiments with results analysis are given in "Experimental analysis".Last, in "Conclusion", the conclusion is summarized.

Theoretical basis Diffusion model
Diffusion model 25 is a novel advanced deep generative model.It gradually transforms data into noise and then learns the de-noising process to generate new samples in both forward and backward directions.Thus The learned de-noising module of diffusion model is adopted to construct gear HIs. Figure 1 illustrates the intuition behind the Diffusion model.
In this study, the de-noising diffusion probabilistic model is employed, which operates through the utilization of two Markov chains.Diffusion Model adopts a progressive nosing and de-noising approach.In the forward process, Gaussian noise is gradually added to the original data layer by layer until it transforms into a simple prior Gaussian distribution.In the reverse process, the noise is gradually eliminated by the deep neural network.The fixed approximate posterior q(x 1:N |x 0 ) in the forward stage is calculated in Eqs. ( 1) and (2), where β n ∈ (0, 1) , N and I are the added Gaussian noise, sample number, and identity matrix.While at the reverse process, a learnable Gaussian transition which is beginning at p(x n ) , with another Markov chain constructs the joint distribution p θ (x n−1 |x n ) , as calculated in Eqs. ( 3) and ( 4), where mean µ θ and variance δ θ are obtained from a deep NN.
The objective of the reverse Markov chain, i.e., computing p θ (x n−1 |x n ) , is to remove the Gaussian noise intro- duced during the forward process.The de-nosing object is p θ (x n−1 |x n ) for the reverse Markov chain.Supposed that x 0 is sampled from the noise p(x n ) , repeating the process from p θ (x n−1 |x n ) until n = 1.
For accurate sampling, make the trained reverse Markov chain p θ (x n−1 |x n ) close to the posterior distribution q(x n−1 |x n , x 0 ) of the forward process given x 0 .And Kullback-Leibler (KL) divergence is chosen as the similarity evaluation metric, whose equations are defined as bellows, In the equation, C is a constant that is independent of θ and μn represents the average value of q(x n−1 |x n , x 0 ) .And the simplified objection is calculated in Eq. ( 6) by adding the noise NN ε θ with parameters θ, where (n) is the function of positive weight.

Long short term memory
LSTM 13 is proposed for releasing the limitation by the nonlinear procession of the data based on the gate mechanism as sown in Fig. 2. The mathematical expression of LSTM is as follows: (1) In Eqs.(7-12), the input weight matrixes w ix (w fx , w ox , w cx ) and the recurrent weight matrixes w ih (w fh , w oh , w ch ) are defined by the nonlinear transformation of x t and h t−1 based on forget (input, output) gate, which decides the forget (input, output) degree of data in the hidden layer; b i (b f , b o and b c ) are the bias of the hidden layer.ct and c t are the internal state and memory state of the cell;⊙ denotes the pointwise multiplication.σ(tanh ) is the sigmoid (tanh) activation function.

Attention-guided multi-hierarchy LSTM
ON-LSTM is first proposed in the NLP field to address the hierarchical structure problem, i.e. "characters, words, and phrases" has a different hierarchy and should be learned in different ways.However, for the vibration signal of mechanical equipment, the hierarchy of order information is difficult to give physical meaning.During the training process, ON-LSTM achieves automatic hierarchy by only providing feedback through the error between predicted and actual results, lacking effective guidance and clear physical interpretation in the hierarchical process.Moreover, the ordered information extracted by ON-LSTM exhibits mixed regions, and the features missed in mixed regions may impact the feature extraction capability.Therefore, this study proposes a new attention-guided multi-hierarchy Long Short-Term Memory (AGMLSTM) neural network that further partitions the mixed hierarchies using the attention mechanism, thereby forming an attention-guided multihierarchy information structure.The similarity between the elements of input vectors and recurrent vectors with attention labels determines the segmentation point between input hierarchies and historical hierarchies, which is the index of the most similar element with attention labels.This means that attention is to guide the hierarchical segmentation and give physical meaning to the hierarchy of ordered information of vibration data.Simultaneously, the multi-hierarchy partitioning enables neural networks to fully utilize ordered information.Information that is easily retained over a long period is assigned a high attention hierarchy, while information that is easily replaceable is assigned a low attention hierarchy.and high intermediate) will be zero when the high and low attention hierarchy information has no interaction.In this case, the information in the zone will not participate in the neural network's update process.
Let x t = x t,1 x t,2 . . .x t,n T and h t−1 = h t−1,1 h t−1,2 . . .h t -1,m T denote the input HIs at time step t and the recurrent data at time step t − 1 , respectively.Compared to other networks, the main difference of AGML- STM lies in the hierarchical information partitioning during the cell unit update process, as illustrated in Fig. 3.The proposed AGMLSTM utilizes attention-guided multi-hierarchy partitioning influenced by attention labels.By calculating the similarity between input data, recurrent data, and the attention label, the maximum attention coefficient element is identified as the hierarchy segmentation point, so that the model identifies the hierarchy from the largest element to the element that is most similar to the label.Thus, the designed hierarchical structure can be combined with RNN through the attention hierarchies of information.By employing the designed update rules, information with a lower attention hierarchy is more prone to forgetting, while information with a higher attention hierarchy is preserved for a longer duration.
Due to the construction of multi-hierarchy information, let's assume that the main and auxiliary hierarchical positions of the input information x t are denoted as d 1 t,i and d 2 t,i , respectively, while the main and auxiliary hierarchical positions of the historical information h t−1 are denoted as d 1 t,f and d 2 t,f .These positions are generated using the following construction functions: F 1 , F 2 , F 3 , and F 4 , guided by the query vector q m .The auxiliary hierarchical positions are used to refine the interval of hierarchical mixing.
The memory cell state vector is updated according to certain rules based on the attention hierarchy of input information and recurrent information.
, the main hierarchy of the input information x t is higher than the main hierarchy of the histori- cal information h t−1 , resulting in an intermediate attention hierarchy.AGMLSTM is capable of further refining the intermediate attention hierarchy and dividing it into sub-hierarchies: low intermediate attention hierarchy, intermediate attention hierarchy, and high intermediate attention hierarchy, shown in Fig. 4. Therefore, when the hierarchical relationship simultaneously satisfies d 2 t,f ≤ d 2 t,i , the auxiliary hierarchy of the input information x t is also higher than the auxiliary hierarchy of the historical information h t−1 .There is an interactive space between d 2 t,f and d 2 t,i .The cell unit update rules are as follows: within the cell unit interval 0, d 1 t,f , the candidate memory cell state vector c t is directly input into the corresponding memory cell, while within the cell unit interval d 1 t,i , d max , the memory cell state vector c t−1 from the previous time step is directly input into the corresponding memory cell.As for the overlapping region d 1 t,f , d 1 t,i , further refinement updates are performed based on the auxiliary hierarchical positions of the input and historical information.For the overlapping region where 1−s 1 is the scale of short-term information in the cellular memory at the case.
For the overlapping region d 2 t,f , d 2 t,i , the update rule of c t is defined as follows: For the overlapping region d 2 t,i , d 1 t,i ,c t is updated by, where 1−s 2 represents the long-term data ratio.Therefore, under this hierarchy distribution c t is presented bellows, When the hierarchical relationship simultaneously satisfies d 2 t,f ≥ d 2 t,i , and the auxiliary hierarchical level of the input information x t is lower than the auxiliary hierarchical level of the historical information h t−1 , there is no interactive space between d 2 t,f and d 2 t,i ,, shown in Fig. 5.In this case, the update mechanisms within the index ranges d 1 t,i , d max and 0,d 1 t,f remain consistent with the first case.However, within the index range d 1 t,f , d 2 t,i , the update of c t is as follows: where 1−s 1 is the short-term information ratio.For elements in the range www.nature.com/scientificreports/where 1−s 2 denotes the long-term information ratio.In summary, c t at the hierarchy is updated by the below rules, , the main hierarchical level of the input information x t is higher than the main hierarchical level of the historical information h t−1 , indicating that the attention focus on the input data than the recurrent data, there are no overlapping cell unit regions.Therefore, c t within the intermediate attention level, there is no need for the mixing of short-term and mid-term memory f t ⊙ c t−1 + i t ⊙ c t to update.Within the cell unit interval d t,i , d t,f , the current time step's cell activation vector is set to zero.c t is the direct input within the cell unit interval 0,d t,i , and for c t−1 is the interval d t,f , d max .At the situation, c t is updated by Eq. ( 24) with its hierarchical partition shown in Fig. 6.
The construction functions F 1 , F 2 , F 3 and F 4 are derived as follows.We first normalize the input data x t and historical data h t−1 using softmax function, introducing four m-dimensional vectors f where w f and w i represent the weight matrices of the softmax layers for historical data and input data, respectively, while b f and b i represent the thresholds of the softmax layers for historical data and input data.
Next, the attention coefficients α 1 t,i , α 2 t,i , 1 t,i and 2 t,i for the input data, and recurrent data are calculated using Eqs.(29-32), respectively: During the training process, the query vector q t,m at this time step t is set as x t+1,n , while during the inference process, it is set as x t,n .
The four scoring functions s i m and s f 2 t,f , q t,m are defined as follows: The maximum positions of the attention coefficients d 1 t,i d 2 t,i are set as the main and auxiliary hierarchical positions of the input information x t ; and the maximum positions of the attention coefficients d 1 t,f and d 2 t,f are set as the main and auxiliary hierarchical positions of the historical information h t−1 , respectively: where index() denotes as the element position extraction function.
To achieve the automatic hierarchical update as described above, the cumulative sum function cumsum() is used to compute the cumulative sums of the attention coefficients, resulting in the main and auxiliary input gates Then, the attention hierarchy structure is partitioned using the following equations: (30) Finally, with the above equations, the propagation equation of AGMLSTM can be written as follows: where other parameters are the same as LSTM.

RUL prediction approach
A health indicator (HI) that can accurately show the degradation process of gears is crucial to the performance of the prediction model.Therefore, the HI of the vibration signal obtained by the trained diffusion model is used in the article for gear RUL prediction, whose superiority has been demonstrated.Considering that most DL approaches for gear RUL prediction are pattern recognition methods, which are influenced by the quantity and quality of data, an RUL prediction approach under limited samples 20 is used in the article, whose flowchart is shown in Fig. 7 and the details are presented as follows: 1.The HI data z = z 1 z 2 . . .z n−1 z n is calculated based on the full-lifecycle vibration data by the sam- pling approach whose sampling time is T and sample interval is t.
2. Then the first part z ′ z 1 z 2 . . .z m−1 z m of z is chosen and linearly normalized to obtain T and output G l+1 , is reconstructed by: where the value of l is equal to the neural numbers of the input layer and G i is denoted by: 4. The training loss L of the proposed model is denoted as the mean square error (MSE) between the last row G l+1 and the predicted Ĝl+1 based on the first l rows of the matrix G.
(45) where f denotes the model transaction function; w, b, and s separately denote the learning matrix.5.After the trained proposed method is obtained, the last l is set as the model input to estimate the HI in the next point.Then the step-by-step prediction is executed by: 6.At last, once the failure threshold is lower than the inversely normalized predicted HIs, the estimated RUL Rul is finally obtained by Eq. (57): where n 1 is the number of predicted HI points before exceeding the threshold.And the actual RUL shows the effectiveness of the proposed method.

Model optimization
The configuration exploration of the predictive model is executed based on grid search.The hyper-parameters, namely, candidates of learning rate α and neuron number in each layer, are constructed as each grid note, which is searched for optimal predictive performance parameters.
The weight matrix w , the bias matrix b , and the proportion matrix s of the model are trained during the training stage based on the loss function Eq. (55) and updated on Eq. (58) by Adam optimizer. (56)

Experimental analysis
Several fatigue full-life experiments are executed by a gear contact fatigue test rig to investigate the lifespan of gears from normal conditions to failure (tooth broken and pitting).The material of the gear for the tooth fracture case was 40Cr, while the gear material for the pitting case was 20CrMnTi.The gear module was set to 5, and the experimental gear case had an oil flow rate of 4 L/h with a cooling temperature of 70 °C.The gears that experienced tooth-broken failures (Dataset 1 and Dataset 2) had tooth counts of 31, 25, 25, and 31, respectively.On the other hand, the gears that suffered from pitting failures (Dataset 3 and Dataset 4) had tooth counts of 26, 24, 24, and 26, as shown in Table 1.
As depicted in Fig. 8, the experimental setup comprises a torque controller, a cooling and lubrication controller, an experimental operation platform, and a gear operation platform.The sampling frequency for the experimental setup is fixed at 50,000 Hz.To minimize data volume, this study sets the recording interval, and the sampling length are 60 s and 10 s.And Part of the healthy state data at the beginning of the run is deleted.Data sets 1 and 3 are used to train the Diffusion model for calculating gear HIs.Then, the trained Diffusion model is used to encode the health indicator points of data sets 2 and 4. To test the prediction ability of the predictive model, this study conducts experiments using the health indicator points from all data sets.Through grid search, optimal hyper-parameters for the AGMLSTM are obtained.For data sets 1, 3, and 4, the number of neurons in the input, hidden, and output layers of AGMLSTM are set to 100, 35, and 1.For data set 2, they are set to 60, 20, and 1.The learning rates for the models on data sets 1, 2, 3, and 4 are set to 0.02, 0.03, 0.05, and 0.05.
Appropriate health indicators can effectively reflect the health condition of mechanical equipment and improve the RUL prediction capability [26][27][28][29][30] .Due to the limitations of single features such as root mean square, kurtosis, and frequency centroid, they may not adequately capture the degradation trend of mechanical equipment in most data sets.Therefore, this study develops a health indicator based on diffusion model that can be used in most cases.Since the signals collected during the steady-state phase contain less degradation information,   www.nature.com/scientificreports/only a portion of the samples from the lifecycle data set is used to calculate the health indicator points using diffusion model and then applied to remaining useful life prediction.Figure 9 displays the obtained health indicator points for all four gear data sets.The constructed health indicator point curves can effectively reflect the degradation trend of gear health, which is highly beneficial for RUL prediction.All gear health indicator curves exhibit a declining trend, and their failure thresholds are similar.This aids in setting a unified failure threshold for different experimental setups, thereby enhancing the robustness of gear RUL prediction.
The study undertook comparative experiments employing distinct optimization algorithms to underscore the superior performance of the chosen optimizer.Specifically, SGDM 31 , RMSprop 32 , and Adam were deliberately selected for comparison within a consistent structural framework, and subsequent optimization was applied across all models.The evaluation process involved ten parallel experiments for each model, focusing on a onehour prediction task.Model performance was rigorously assessed using key performance indicators, namely the mean absolute error (MAE), the normalized root mean square error (NRMSE), the mean absolute percentage error (MAPE), and Score 23 , as presented in Fig. 10.
It can be concluded that the model adopted by Adam has the lowest values of MAE, NRMSE, and MAPE, and the highest Score value.This means that with the Adam optimizer, the proposed method has better RUL prediction performance.Thus, Adam is more suitable for the proposed method when it deals with gear RUL prediction.
The evaluation indicators of different HIs for different gear datasets are respectively calculated and the mean value of evaluation indicators are listed in Table 2. First the two widely used statistical features such as RMSE and Kurtosis in PHM 20 are chosen as HIs.Then HI based on popularity learning is constructed, i.e.PCA.Finally,   33 , and variational autoencoder (VAE) 24 .
In Table 2, a comprehensive analysis of the evaluation indicators for HIs reveals that those generated by the diffusion model consistently outperform other HIs across gear datasets.Notably, the values of monotonicity and the comprehensive indicator for the diffusion model-reconstructed HI stand out, reaching impressive scores of 0.955 and 0.921, respectively.This signifies that the HI constructed through the diffusion model adeptly captures and reflects the degradation trend in gear datasets.The comparison across different HIs reveals that those generated by DBN, VAE, and the diffusion model surpass those based on PCA, RMSE, and Kurtosis.This suggests that HIs constructed by neural networks exhibit greater flexibility when dealing with HIs under fixed patterns, although they may not be ideal for reflecting the degradation trend in gear datasets.Besides, the diffusion model stands out by delivering strong performance evaluation results.This highlights its superior generalization ability, indicating that the HIs produced by the diffusion model are well-suited for assessing health status in gear datasets.Consequently, the HIs constructed by the diffusion model effectively and reliably capture the degradation trend in gear systems.
Using the small-sample life prediction method, the proposed AGMLSTM is compared with classical models (LSTM, GRU) and published deep learning models, i.e.Gated dual attention unit (GDAU) 20 , On-LSTM 21 , Coctail LSTM (CLSTM) 24 , for RUL prediction on the four gear data sets.To compare the prediction accuracy and robustness of each method, grid search is used to obtain the optimal hyper-parameters for each model, and then all tuned networks are tested 10 times on each gear data set.The prediction task is set as predicting 60 HI points (1-h RUL) for the comparative experiment, comparing the prediction capabilities of the benchmark models.Based on the experimental prediction results, MAE, NRMSE, MAPE and Score are used to quantitatively evaluate the prediction accuracy, as shown in Fig. 11.
As illustrated in Fig. 11, the superiority of the proposed AGMLSTM model over other counterparts is evident, showcasing exceptional performance in predicting RUL.This observation underscores the significant impact of incorporating comprehensive ordered information, especially when employing attention mechanisms at the hidden layer level.The strategic utilization of attention mechanisms facilitates the network models in effectively navigating data heterogeneity, leading to a remarkable enhancement in RUL estimation accuracy.Based on the actual gear tests, the outperformance of the prosed RUL prediction method is proven by MAE, NRMSE, MAPE, and Score, with improvement of 33%, 40%, 17%, and 8% respectively compared with the state-of-art.Consequently, the proposed models emerge as highly apt for the precise prediction of gear remaining useful life, attributing their success to the adept utilization of ordered information and attention-guided learning mechanisms.
AGMLSTM and CLSTM refine the mixed hierarchy through fine-grained processing based on the introduced main and auxiliary gating mechanisms.The distinction lies in the fact that AGMLSTM employs an attention mechanism for hierarchical localization.Consequently, while AGMLSTM and CLSTM achieve better RUL prediction accuracy compared to ON-LSTM and GDAU, they come with an increased parameter count.With the same number of hidden layer neurons L n , AGMLSTM increases the parameter count compared to 8 × L n ON- LSTM and 16 × L n GDAU, and is approximately equivalent to CLSTM.To provide a more intuitive representation of the network's computational complexity, we calculated the time required for each iteration during the training process on the same computer device, as shown in Table 3.
From Table 3, it is evident that AGMLSTM and CLSTM incur a higher time cost than On-LSTM.This is attributed to the different hierarchical learning mechanisms these models employ for input processing, with additional gating units introducing more network parameters.The GDAU, which incorporates dual attention gates, exhibits a similar phenomenon.Additionally, it is crucial to note that the training phase is offline, and during the online prediction phase, the trained AGMLSTM incurs a prediction time of only 7.8 × 10 -5 s.Hence, the prediction time overhead of AGMLSTM is deemed acceptable considering its superior long-term RUL prediction accuracy.
Based on the above analysis, the rational and comprehensive use of ordered information is crucial for enhancing the accuracy of gear RUL prediction, especially in cases where known samples contain less gear degradation information.Therefore, the proposed method AGMLSTM, guided by an attention mechanism for multi-hierarchy partitioning, effectively extracts more gear state degradation information, resulting in superior overall RUL prediction performance compared to other methods.
Illustrating the robustness of our proposed small-sample intelligent prediction method, we employ data set 3 as a paradigmatic case study, harnessing the AGMLSTM model for an insightful exploration of RUL prediction across diverse forecast horizons.The delineation of the training set, consisting of known data from the initial segment, and the validation set, featuring unknown data from the subsequent portion, lays the groundwork for a comprehensive evaluation.Intriguingly, the AGMLSTM model's prowess is vividly showcased through an in-depth analysis of its predictive capabilities on data set 1, where the focus is squarely on anticipating 90, 70, and 50 HIs.As delineated in Figs. 12, 13 and 14, a compelling narrative unfolds, elucidating a direct correlation between the increasing number of known HIs and the model's augmentation in prediction proficiency.The figures distinctly reveal a convergence of estimated health indicator points towards their true counterparts, affirming the method's precision and efficacy.Crucially, the overarching alignment between prediction values and actual values across a spectrum of forecast instances underscores the AGMLSTM model's unparalleled effectiveness in gear RUL prediction.This nuanced ability to predict with heightened precision as our understanding of health indicators expands substantiates the model's robustness and underscores its potential for real-world applications.In Fig. 15, the prowess of AGMLSTM in predicting RUL at varying known health indicator points is rigorously assessed using the MAE.A compelling trend unfolds, revealing a noteworthy inverse correlation: the MAE values exhibit a consistent decline as the number of health indicator points rises.This observation underscores the model's heightened proficiency with an expanding set of health indicators.Examining specific instances, for a prediction involving 30 health indicator points, the RUL prediction boasts a mere 5% percentage error.Intriguingly, with an escalation to 60 health indicator points, the percentage error marginally increases to 8%.The augmentation of known HIs entails the incorporation of expanding HIs encompassing fault information into the model training process.This influx of HIs allows the model to assimilate a broader spectrum of fault trends, leading to a progressive enhancement in its predictive capabilities.These outcomes signify AGMLSTM's commendable performance in protracted RUL prediction, showcasing its capacity for sustained accuracy.To further underscore the model's prowess in long-term RUL estimation, a bold attempt is made to predict 90 health indicator points, as illustrated in Fig. 14 Despite a 25% error in the computed result, this endeavor unequivocally establishes AGMLSTM's formidable predictive aptitude for enduring gear RUL scenarios.

Conclusion
Revolutionizing gear RUL prediction, our groundbreaking approach introduces a novel methodology by constructing HIs through a diffusion model, coupled with the innovative AGMLSTM predictor.Leveraging the temporal and frequency characteristics of vibration measurements, the diffusion model lays the foundation for a distinctive gear HI.This HI, in turn, serves as the linchpin for AGMLSTM, a pioneering predictor designed to comprehensively and judiciously mine ordered information for precise gear RUL forecasts.The strategic incorporation of rich ordered information significantly amplifies the feature extraction capabilities of our predictor, leading to a substantial enhancement in RUL prediction accuracy.Validation through rigorous real-world gear tests unequivocally demonstrates the superior performance of our proposed RUL prediction method.Employing widely accepted evaluation metrics, our approach realizes 8 on MAE, 0.3 on NRMSE, 0.1 on MAPE, and 0.52 on Score, showcasing an impressive improvement of 33%, 40%, 17%, and 8% respectively, compared to state-of-theart methods.In essence, our proposed approach emerges as the pinnacle of gear RUL prediction methodologies, providing not only heightened accuracy but also unparalleled effectiveness in real-world scenarios.The proposed methodology in this study primarily addresses the RUL under conditions of single-tooth breakage or pitting failure.However, in practical engineering applications, failures frequently involve the coupling of multiple faults.Therefore, the development of a methodology for predicting the RUL in cases of complex gearbox failure is of significant importance.This aspect will be a key focus of our future research endeavors.

Figure 1 .
Figure 1.The details of diffusion model.

i 1 t and i 2 t 1 t and f 2 t
, as well as the main and forget gates f , which can be written as follows:

Figure 7 .
Figure 7.The flowchart of the proposed RUL prediction approach.

Figure 10 .
Figure 10.Comparison of predictive ability under different optimizers.

Figure 11 .
Figure 11.The gear RUL estimation performance of different methods.

Figure 12 .
Figure 12.Prediction illustration for 30 predicted points of data 3.

Figure 13 .
Figure 13.Prediction illustration for 60 predicted points of data 3.

Figure 14 .
Figure 14.Prediction illustration for 90 predicted points of data 3.

Figure 15 .
Figure 15.MAEs of RUL prediction results under different known HI points.

Table 1 .
Description of data.

Table 3 .
The complexity analysis of models.